Gradient descent with identity initialization efficiently learns positive definite linear transformations by deep residual networks
نویسندگان
چکیده
We analyze algorithms for approximating a function $f(x) = \Phi x$ mapping $\Re^d$ to $\Re^d$ using deep linear neural networks, i.e. that learn a function $h$ parameterized by matrices $\Theta_1,...,\Theta_L$ and defined by $h(x) = \Theta_L \Theta_{L-1} ... \Theta_1 x$. We focus on algorithms that learn through gradient descent on the population quadratic loss in the case that the distribution over the inputs is isotropic. We provide polynomial bounds on the number of iterations for gradient descent to approximate the optimum, in the case where the initial hypothesis $\Theta_1 = ... = \Theta_L = I$ has loss bounded by a small enough constant. On the other hand, we show that gradient descent fails to converge for $\Phi$ whose distance from the identity is a larger constant, and we show that some forms of regularization toward the identity in each layer do not help. If $\Phi$ is symmetric positive definite, we show that an algorithm that initializes $\Theta_i = I$ learns an $\epsilon$-approximation of $f$ using a number of updates polynomial in $L$, the condition number of $\Phi$, and $\log(d/\epsilon)$. In contrast, we show that if the target $\Phi$ is symmetric and has a negative eigenvalue, then all members of a class of algorithms that perform gradient descent with identity initialization, and optionally regularize toward the identity in each layer, fail to converge. We analyze an algorithm for the case that $\Phi$ satisfies $u^{\top} \Phi u>0$ for all $u$, but may not be symmetric. This algorithm uses two regularizers: one that maintains the invariant $u^{\top} \Theta_L \Theta_{L-1} ... \Theta_1 u>0$ for all $u$, and another that"balances"$\Theta_1 ... \Theta_L$ so that they have the same singular values.
منابع مشابه
On the importance of initialization and momentum in deep learning
Deep and recurrent neural networks (DNNs and RNNs respectively) are powerful models that were considered to be almost impossible to train using stochastic gradient descent with momentum. In this paper, we show that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train...
متن کاملOrthogonal and Idempotent Transformations for Learning Deep Neural Networks
Identity transformations, used as skip-connections in residual networks, directly connect convolutional layers close to the input and those close to the output in deep neural networks, improving information flow and thus easing the training. In this paper, we introduce two alternative linear transforms, orthogonal transformation and idempotent transformation. According to the definition and pro...
متن کاملRectified linear neural networks with tied-scalar regularization for LVCSR
It is known that rectified linear deep neural networks (RL-DNNs) can consistently outperform the conventional pretrained sigmoid DNNs even with a random initialization. In this paper, we present another interesting and useful property of RLDNNs that we can learn RL-DNNs with a very large batch size in stochastic gradient descent (SGD). Therefore, the SGD learning can be easily parallelized amon...
متن کاملAn Augmented Conjugate Gradient Method for Solving Consecutive Symmetric Positive Definite Linear Systems
Many scientific applications require one to solve successively linear systems Ax = b with different right-hand sides b and a symmetric positive definite matrix A. The conjugate gradient method applied to the first system generates a Krylov subspace which can be efficiently recycled thanks to orthogonal projections in subsequent systems. A modified conjugate gradient method is then applied with ...
متن کاملHandwritten Character Recognition using Modified Gradient Descent Technique of Neural Networks and Representation of Conjugate Descent for Training Patterns
The purpose of this study is to analyze the performance of Back propagation algorithm with changing training patterns and the second momentum term in feed forward neural networks. This analysis is conducted on 250 different words of three small letters from the English alphabet. These words are presented to two vertical segmentation programs which are designed in MATLAB and based on portions (1...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1802.06093 شماره
صفحات -
تاریخ انتشار 2018